Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit CNNs

107

BN layer, non-linear activation layer, and max-pooling layer. We omit these for the sake

of simpliﬁcation. Then, the output ˆaout is binarized to b^ˆ^a^outby the sign function. The

fundamental objective of BNNs is to calculate ˆw. We want it to be as close as possible before

and after binarization to minimize the binarization eﬀect. Then, we deﬁne the reconstruction

error following [77] as

LR( ˆw, β) = ∥ˆw −β ◦b ^ˆ^w∥²

2^.

(4.23)

Based on the above derivation, the vanilla direct BNAS [36, 114] can be deﬁned as

max

ˆw∈W,ˆα∈A,β∈R⁺^f^b^{( ˆ}^w^,^ˆ^{α, β}⁾^,

(4.24)

where b ^ˆ^w= sign( ˆw) is used for inference and ˆα is a neural architecture with binary weights.

Prior to direct BNAS [36] learning the BNAS from such an objective as

max

ˆw∈W,ˆα∈A,β∈R⁺^˜^f^b^{( ˆ}^w^,^ˆ^{α, β}^{) =}

n=1

ˆpn( ˆw, ˆα, β) log(ˆpn(X)),

(4.25)

where we use notations similar to those of Eq. 4.21. Equation 4.25 means that the vanilla

direct BNAS only focuses on the binary search space under the supervision of cross-entropy

loss, which is less eﬀective due to the search process being not exhaustive [24].

4.4.2

Redeﬁne Child-Parent Framework for Network Binarization

Network binarization calculates neural networks with 1-bit weights and activations to ﬁt the

full-precision network, which can signiﬁcantly compress the CNNs. Prior work [287] usually

investigates the binarization problem by exploring the full-precision model to guide the

optimization of binarized models. Based on the investigation, we reformulate NAS-based

network binarization as a Child-Parent model as shown in Fig. 4.12. The Child and Parent

models are the binarized and full-precision counterparts, respectively.

Conventional NAS is ineﬃcient due to the complicated reward computation in network

training, where the evaluation of a structure is usually done after the network training

converges. There are also some methods to evaluate a cell during the training of the network.

[292] points out that the best choice in the early stages is not necessarily the ﬁnal optimal

one. However, the worst operation in the early stages usually has a bad performance. This

phenomenon will become more and more signiﬁcant as the training goes on. Based on this

observation, we propose a simple yet eﬀective operation-removing process, which is the

crucial task of the proposed CP model.

Intuitively, the representation diﬀerence between the Children and Parents, and how

many Children can independently handle their problems are two main aspects that should

be considered to deﬁne a reasonable performance evaluation measure. Based on this analysis,

we introduce the Child-Parent framework for binary NAS, which deﬁnes the objective as

ˆw^∗, ˆα^∗, β^∗=

argmin

ˆw∈^ˆ

W,α∈A,β∈R⁺

LCP-NAS( ^˜f ^P(w, α),

˜f ^C

b ^{( ˆ}^w^,^ˆ^{α, β}⁾⁾

argmin

ˆw∈^ˆ

W,α∈A,β∈R⁺

˜f ^P(w, α) −˜f ^C

b ^{( ˆ}^w^,^ˆ^{α, β}⁾^,

(4.26)

where ^˜f ^P(w, α) denotes the performance of the real-valued parent model as predeﬁned in

Eq. 4.21. ^˜f ^C

b ^{is further deﬁned as ˜}^f^C

b ^{( ˆ}^w^{, α, β}^{) =}^N

n=1 ^ˆ^pⁿ^{( ˆ}^w^{, α, β}^{) log (ˆ}^pⁿ⁽^X^{)) following}

Eq. 4.25. As shown in Eq. 4.26, we propose L to estimate the performance of candidate